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Abstract —We present the collaborative Kalman filter (CKF), 
a dynamic model for collaborative filtering and related fac¬ 
torization models. Using the matrix factorization approach to 
collaborative filtering, the CKF accounts for time evolution 
by modeling each low-dimensional latent embedding as a 
multidimensional Brownian motion. Each observation is a 
random variable whose distribution is parameterized by the 
dot product of the relevant Brownian motions at that moment 
in time. This is naturally interpreted as a Kalman filter with 
multiple interacting state space vectors. We also present a 
method for learning a dynamically evolving drift parameter 
for each location by modeling it as a geometric Brownian 
motion. We handle posterior intractability via a mean-field 
variational approximation, which also preserves tractability 
for downstream calculations in a manner similar to the 
Kalman filter. We evaluate the model on several large datasets, 
providing quantitative evaluation on the 10 million Movielens 
and 100 million Netfiix datasets and qualitative evaluation on 
a set of 39 million stock returns divided across roughly 6,500 
companies from the years 1962-2014. 

1. Introduction 

Collaborative filtering is a general and effective method 
for making pairwise predictions based on historical data. It 
is most frequently associated with recommendation systems, 
where the model learns user preference by considering 
all previous user/object interactions. User rating systems 
are widespread online. For example, Netfiix lets users rate 
movies on a five star rating system, Amazon allows users 
to rate products and YouTube allows users to “like” or 
“dislike” videos on their website. In cases such as these, 
a major objective is to recommend content to users in 
an intelligent way. Collaborative filtering addresses this 
problem by allowing the preferences of other users to inform 
the recommendations made for a user of interest ca. 

Though typically used for predicting user responses, the 
main idea behind collaborative filtering can be broadened to 
include other situations where the outcome of an event is 
predicted using information about the outcomes of all other 
events in a dyadic setting. For example, in sporting events 
one might want to predict whether team A will beat team 
B by using the results of all other games up to that point. 
Or one might wish to estimate the probability that batter 
X will get a hit against pitcher Y at a specific moment 
in time using all previous at-bats for every batter/pitcher 
combination. In other problem domains, one might wish to 
predict the stock prices or exchange rates in a way that takes 


into consideration the patterns of all other measurements up 
until that point. 

An effective approach to collaborative filtering is via 
matrix factorization m, CD. These methods embed users 
and objects into a low-dimensional latent space and make 
predictions based upon the relevant dot products. Such 
methods have proved very powerful since they are related to 
the low-rank matrix completion problem, wherein it is shown 
that a sparsely sampled low-rank matrix can be exactly 
recovered in ideal cases, or very closely otherwise O. In the 
Bayesian setting, treating missing data is efficiently handled 
by the joint likelihood function, which allows for simple 
closed form updates to the latent embeddings ifT^ . 

One common assumption in matrix factorization ap¬ 
proaches is that the model is “static,” meaning that every 
user or object has a latent location that is fixed in time and 
the model learns this single point using all data in a “batch” 
setting. Therefore, these models assume, e.g., that a user’s 
taste in movies is unchanging, or a team’s sequence of wins 
and losses is a stationary process. Though performance of 
such models can be very good, this assumption is clearly 
untrue and models that account for dynamic information are 
more appropriate in principle. 

In this paper we present a dynamic collaborative filter¬ 
ing model for temporally evolving data for applications 
including recommendation systems, currency exchange rate 
tracking, student/question tutoring systems and athletic com¬ 
petition prediction, among others. The model is inspired by 
the matrix factorization approach to collaborative filtering, 
where each object has a location in a latent space. However, 
rather than fixing these locations, we allow them to drift 
according to a multidimensional Brownian motion m Since 
we are motivated by real-time prediction, we only process 
each observation in the data stream once. Therefore our 
problem is ideally suited to large data sets. For inference 
we introduce a variational approximation 06) since the 
posterior distribution of model variables given an event is 
not in closed form; this allows for closed form updates 
and preserves tractability for future calculations. The model 
thus learns a time-evolving probability distribution on latent 
user/object locations. From the perspective of a single loca¬ 
tion, the proposed model can be interpreted as a state-space 
model with dynamics similar to a Kalman filter CD, and so 
we call our approach the collaborative Kalman filter (CKF). 

In addition, for problems such as stock modeling the 


drift parameter of the latent Brownian motions cannot be 
considered to be a fixed value because of the changing 
volatility of the market. We therefore further develop the 
CKF by modeling the drift parameter of each latent location 
with a geometric Brownian motion. This approach will re¬ 
quire approximations for tractable inference that nevertheless 
provide good and meaningful results as shown on a large 
stock dataset. 

We organize the paper as follows: In Section]^ we review 
the basic matrix factorization approach to collaborative 
filtering and the Kalman filter for time evolving linear 
modeling. In Sectionjl^we present the collaborative Kalman 
filtering model. We derive a variational inference algorithm 
for this model in Section and show experimental results 
in Section El 


II. Background 


Before presenting the collaborative Kalman filter (CKF), 
we review the basic matrix factorization approach to col¬ 
laborative filtering and the Kalman filter for dynamic state 
space modeling. These two approaches form the basis for 


our model, which we present in Section III 


A. Collaborative filtering with matrix factorization 

Generally speaking, the matrix factorization approach to 
collaborative filtering models an incomplete MxN matrix Z 
as a function of the product of two matrices, 3. KxM matrix 
U and Si K X N matrix W Q. The columns of U (denoted 
Ui) represent the locations of each user in a latent space 
and the columns of W (denoted Wj) represent the locations 
of each object in the same latent space. In the probabilistic 
approach to collaborative filtering using matrix factorization, 
the observed values in Z have a distribution that depends 
on the dot products of U and W, Zij ^ p{ufwj). For 
example, when Z is binary, pf) could be the logistic or 
probit link functions, when m-ary it could be the ordered 
probit model, and when real-valued it could be a univariate 
Gaussian. After learning U and W based on the observed 
data, predictions of Zij for the unseen pair (i, j) can be made 
using the distribution function p{ujwj). 

To see how matrix factorization can be interpreted as a 
collaborative filtering technique, consider the case where 
Zij ^ N{ufwj,a‘^). Using a maximum likelihood inference 
approach for U and W, the maximum likelihood update of 
Ui is a least squares vector using the relevant Wj and Zij . If 
we let the set contain the index values of objects that 
user i has rated, this update is 

Ui= ZijWj) ■ ( 1 ) 

Therefore, the design matrix for Ui uses the locations of all 
objects Wj rated by user i. The update for each Wj parallels 
this update for Ui using the relevant user locations, and so 
the locations of all other users indirectly influence the update 
for user i in this way. 


A large number of collaborative filtering methods can 
be viewed as variations and developments of this basic 
model. A potential drawback of such approaches is that 


will be to show how this model can be easily extended to 
model the temporal evolution of Ui and Wj using Brownian 
motion. Since the resulting model has dynamics similar to 
the Kalman filter, we briefly review this method next. 

B. The Kalman filter 

As mentioned, a drawback of many collaborative filter¬ 
ing techniques such as those that involve updates similar 
to Equation 0 is that no temporal dynamics are being 
modeled. When viewing the locations of a user Ui or 
object Wj as being in a state space, the Kalman filter 
naturally arises as the first place to look for developing a 
dynamic representation for matrix factorization models. In 
this section we provide a brief review of the Kalman filter 
113 , focusing on the equations and using parameterizations 
that are relevant to our CKF model. 

The Kalman filter models a sequence of observed vectors 
Un G W as linear functions of a sequence of latent state 
vectors Wn G with additive noise. These state vectors 
evolve according to a first-order Markov process, where the 
current state equals the previous state plus additive noise. 
Assuming a Gaussian prior distribution on wq, then for n = 
1,..., A" and zero-mean noise, this is written as follows, 

Wn+l\Wn ~ N{Wn,aI), ( 2 ) 

yn+l\Wn+l ~ N{AWn+l,a'^I). ( 3 ) 

Inference for the state evolution of w proceeds according 
to the forward algorithm, which includes two steps: (i) 
marginalizing the previous state to obtain a prior distribution 
on the current state, and (ii) calculating the posterior of the 
current state given an observation]^ 

Specifically, let the distribution of the current state be 
Wn ^ N{pn^^n)- The distribution of the next state Wn-\-i 
is calculated by marginalizing the current state, 

p{Wn+i) = / p{Wn+l\Wn)p{Wn)dWn 

= N{Wn+l\i>'n,aI+ T,n). ( 4 ) 

The drift parameter a controls the volatility of the state 

vectors and is a measure of drift in one unit of time. (In 
this paper, we present a method for learning a dynamically 
evolving value for a.) 

After observing Pn+i, the posterior of Wn-\-i is 

p{Wn+l\yn+l) OC piUn+llWnWPiWn+l) 

= ( 5 ) 

^Because we are only interested in real-time prediction in this paper, we 

do not discuss the backward algorithm used for Kalman smoothing. 


they do not use temporal information. A goal in Section 


III 




where, after defining = a/ + 


S„+1 = {A^A/a^ + B-^)-\ (6) 

A^n+l = S„+i(A^ 2//(7^ + B“Vn)- (7) 

Following Equation we can use this posterior distribu¬ 
tion on Wn+i to find the prior for the next state Wn+ 2 - This 
also indicates how jin and were obtained for step n, and 
so only an initial Gaussian distribution on wq is required to 
allow for analytical calculations to be made for all n. 

The extension to continuous time requires a simple mod¬ 
ification: For continuous-time Kalman filters, the variance 
of the drift from Wn to iCn+i depends on the time between 
these two events. Calling this time difference Atn, we can 
make the continuous-time extension by replacing al with 
AtnOil. Therefore re is a Brownian motion 0. 

C Related work 

A similar model that addressed spatio-temporal infor¬ 
mation was presented in m A fixed-drift version of the 
model presented here was originally presented at a machine 
learning workshop Co) and is most closely related to ca, 
ca. Because the models share the same name, we discuss 
some important differences before presenting our version, 
and also to highlight our contributions: 

• Our approach does not require Kalman smoothing. 
Therefore, we do not store the sequence of state vectors, 
which allows for our model to scale to massive data 
sets. The model in flSj , flAj is only designed to handle 
small-scale data problems. 

• We derive a variational inference algorithm for learning 
time-evolving distributions on the latent state space 
vectors akin to the Kalman filter. We also use a 
geometric Brownian motion for learning a dynamic 
drift parameter. Our experiments will show that a 
fully Bayesian approach on the state space improves 
predictive ability and learning a dynamic drift gives 
significant information about the underlying structure 
of the data set. 

• We derive a probit model for binary and ordinal obser¬ 
vations such as movie ratings instead of directly model¬ 
ing all data types with Gaussians. We will show that this 
approach gives an improvement in performance when 
compared with modeling, e.g., discrete star ratings as 
real-valued Gaussian random variables. 


III. Collaborative Kalman Filtering 

The motivation behind the collaborative Kalman filter 
(CKF) is to extend the matrix factorization model described 
in Section [IFA| to the dynamic setting. The continuous-time 
Kalman filtering framework discussed in Section |II-B| is a 


natural means for doing this. The CKF models each latent 
location of a user or object as moving in space according 
to a Brownian motion. In our case, these correspond to the 
sets of vectors {ui} and {wj}, which are now functions of 


time. At any given time t, the output for dyad (i, j) uses the 
dot product {ui[t] ,Wj[t]) as a parameter to a distribution 
as discussed in Section III-AI 

In the following discussion of the model we focus on 
one event (e.g., rating, trial, etc.), which can then be easily 
generalized to account for every event in a way similar to the 
Kalman filter. We also present the model in the context of 
m-ary prediction using ordered probit regression 0. For this 
problem, the real line M is partitioned into m regions with 
each region denoting a class, such as star rating. Let Ik = 
{Ik, Vk] be the partition for class k where h < '^k = 

Ik = Vk-i and k = 1 ,. .. ,m. The output for pair at 
time t, denoted Zij[t] G {1,..., m}, is obtained by drawing 
from a Gaussian distribution with the mean {ui[t] ,Wj[t]) 
and some preset variance and finding which partition it 
falls within. Therefore, the model assumes an order relation 
between the m classes, for example that a 4 star rating is 
closer to a 5 star rating than a 1 star rating. (In the binary 
special case, this order relation no longer matters.) We give 
a more detailed description of the model below. 

Likelihood model: Let Ui[t] G represent the la¬ 
tent state vector for object i at time t and similarly for 
Wj[t] G Using the partition X and the probit function, 
the probability that Zij[t] = k is 


P{zij[t] = k\ui,Wj) = / N {y\{ui[t],Wj[t]),a‘^) dy. 
Jik 

( 8 ) 


For inference we introduce the latent variable yij[t] and 
expand Equation <0 hierarchically. 


Zij[t]\yij[t] = eXfc), (9) 

yij[t]\ui,Wj ~ N{{ui[t] ,Wj[t]),a^). (10) 


We observe that, unlike most collaborative filtering models, 
there is no assumption that the pair (i,j) only interacts 
once, which has practical advantages. For example, teams 
may play each other multiple times, and even in a user 
rating setting the user may change his or her rating at 
different points in time, which contains valuable dynamic 
information. In related problems such as stock modeling, 
updated values arrive at regular intervals of time|^ 

Prior model: At time t we are interested in posterior 
inference for state vectors Ui[t] and Wj[t]. This requires 
a prior model, which as we’ve discussed we take to be a 
latent multidimensional Brownian motion. Let the duration 
of time since the last event be for Ui and for 
Wj. Also, as with the Kalman filter, assume that we have 
multivariate normal posterior distributions for Ui[t — ] 

and Wj[t — ], with t — being the time of the 

last observation for Ui and similarly for Wj. We write these 


^As we will see, in this case y is observed and the likelihood is dl0|. 







distributions as 

udt - - A»' ]), 

Wj[t- A'^> ] - - A'^> ], s;,[t - A'^' ]). (11) 

The posterior parameters fi' and TI are also dynamically 
evolving and indexed by time. As is evident, we are assum¬ 
ing that we have independent posteriors of these variables. 
Though this is not analytically true, our mean-field varia¬ 
tional inference algorithm will make this approximation and 
so we can proceed under this assumption. 

By the continuous-time extension of Equation (0, we can 
marginalize Ui[t — ] and Wj[t — ] in the interval 

to obtain the prior distributions on Ui and Wj at time t, 

Ui[t] - 

Wj[t] - (12) 

where 

= ij'[t - A'*' ], I][t] = I]'[i - A'*’ ] + A'*' al, 

for the respective Ui and Wj. We note that we use an 
apostrophe to distinguish priors from posteriors. 

We see that we can interpret the CKF as a Kalman filter 
with multiple interacting state space vectors rather than a 
known and fixed design matrix A as in Equation 0- We also 
recall from the Kalman filter that al is the drift covariance 
in one unit of time, and so the transition from posterior to 
prior involves the addition of a small value to the diagonal 
of the posterior covariance matrix. This is what allows for 
dynamic modeling; when a = 0, this is simply an online 
Bayesian algorithm and the posteriors will converge to a 
delta function at the mean. 

Hyperprior model: The drift parameter a controls how 
much the state vectors can move in one unit of time. When a 
becomes bigger, the state vectors can move greater distances 
to better fit the variable y, also making these vectors more 
forgetful of previous information. When a ^ 0 the model 
does not forget any previous information and the state 
vectors simply converge to a point since the posterior distri¬ 
bution becomes more concentrated at the mean. Therefore, 
a is clearly an important parameter that can impact model 
performance. 

We develop the CKF model by allowing a to dynamically 
change in time as well. This provides a measure of volatility 
that will be especially meaningful in models for stock prices, 
as we will show. For notational convenience, we define the 
hyperprior for a shared a, but observe that extensions to a 
Ui- and rcj-specific a is straightforward (which we discuss 
in more detail in Section ||V]). 

We model a as a geometric Brownian motion by defining 
a[t] = and letting a[t] be a Brownian motion. 

Therefore, as with the state vectors, if t — is the last 
observed time for a[t], the distribution of a[t] is 

a[t] ~ N{a[t - Ay ], cA ^). (13) 


Again there is a drift parameter c that plays an important 
role and requires a good setting, but we observe that defining 
a to be an exponentiated Brownian motion has an important 
modeling purpose by allowing for volatility in time, and is 
not done simply to avoid parameter setting. 

When a is a geometric Brownian motion, the transition 
from posterior to prior for Ui and Wj needs to be modified. 
In Equation ( p^ the constant value of a allowed for T^[t] 
to be calculated by adding a time-scaled al to the previous 
posterior. In this case, the update is modified for, e.g., u by 
performing the integration implied in Equation (12), 

- A'„*' ] + / ds. (14) 


We will derive a simple approximation of this this stochastic 
integral for inference. 


IV. Variational Inference for the CKF 

Having defined the model prior, we next turn to posterior 
inference. Since the CKF model is dynamic, we treat poste¬ 
rior inference in the same way by learning a time-evolving 
posterior distribution on the underlying vectors {ui} and 
{wj} (and the hidden data yij[t] when the observations are 
from a latent probit model). We also learn the Brownian 
motion controlling the drift of each latent state vector, a[t], 
using a point estimate. We therefore break the presentation 
of our inference algorithm into two parts: One dealing with 
u, w and y, and one dealing with a. We present an overview 
of the inference algorithm in Algorithm 

A. An approximation for the hyperprior model 

Before discussing our inference algorithm, we first present 
an approximation to the stochastic integral in Equation 
that will lead to efficient updates of the Brownian 
motion a[t]. To this end, we first introduce the following 
approximation, 

(15) 

We recall from the description of the model that a[t] can 
be shared or vector-specific. That is, e.g., a[t] could be 
unique for each Ui, or shared among {ui} (in which case an 
appropriate subscript notation for a could be added). From 
the perspective of a generic multidimensional Brownian 
motion, the generative structure for this approximation is 

u[t] ~ A„],I];[t-A„] +e“'*’A„J), 

a[t] ~ N{a[t - Aa] ,cAa). (16) 

That is, with this approximation we first draw the log drift 
value at time t, a[t], according to its underlying Brownian 
motion. We then approximate the drift parameter as being 
constant in the unobserved interval. Clearly, as the intervals 
between observations become smaller, this approximation 
becomes better. For inference we will learn a point estimate 
of the Brownian motion a[t]. 



Algorithm 1 CKF parameter updates at time t 


1 

2 

3 
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5 
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To update the parameters of dyad at time t: 
for iteration = 1,..., T do 

Update q{yij[t ]) as in Equation (24). 

Update q{ui[t]) as in (26). 

Update q{wj[t ]) by modifying (26) as noted in text. 
Update Ou^it] as in ( [27| ). 

Update ay^.[t] by modifying ( [27| ) as noted in text. 

end for 

Notes: The following changes can be made. 

• When y is observed, skip Step 3 and directly use y in 

Steps 4 and 5. 

• When modeling a shared Ou among all {u}, the update 

is changed as noted in text, and similarly for w. 


B. A variational approximation 

Focusing on the latent state-space variables first, the 
posterior of an event at time t is 

p{ui[t] ,yij[t] \zij[t],aui,ay,.) oc (17) 

\yij)p{yij[t'\ \ui,Wj)p{ui[t] \aui)p{wj[t] la^.) 

Unlike the Kalman filter the full posterior of the CKF 
is not analytically tractable. Typically when this occurs 
with nonlinear Kalman filters sampling methods such as 
sequential Monte Carlo are employed as an approximation 
i). This is not computationally feasible for our problem 
since we are learning the evolution of a large collection of 
latent state vectors. Instead we use the deterministic mean- 
field variational method for approximate posterior inference 
ca. As we will show, this not only allows for tractable 
posterior calculations for the state vectors u, w and hidden 
data y, but also allows for a transition from prior to posterior 
similar to the Kalman filter that retains tractability for all 
downstream calculations. 

We approximate the posterior of Equation with the 
following factorized distribution. 


q{ui[t] ,Wj[t] ,yij[t]) = q{ui[t] )q{wj[t] )q{yij[t ]). 

( 18 ) 

We derive the specific distributions in Section |IV-C| This 
q distribution approximates the posterior of each variable 
as being independent of all other variables, which has a 
significant advantage in this dynamic setting. Having defined 
q, we seek to minimizes the KL divergence between q and 
the true posterior at time t by equivalently maximizing the 
variational objective function ca, 


C. Variational q distribution for the prior model 

As indicated in Equation ( p^ , constructing the variational 
objective function for the event at time t involves taking 
the expectation of the log joint likelihood and adding the 
entropies of each q. This requires defining the distribution 
family for q{yij), q{ui) and q{wj). The optimal form for 
each q distribution can be found by exponentiating the 
expected log joint likelihood holding out the q distribution 
of interest (161: qi ex exp{Eg_Jlnp(-)]}. 

Temporarily ignoring the time notation, at time t we can 
find q{yij[t] ) by calculating 

qiVij) oc eyip{\np{zij\yij) + ^q\^np{yij\ui,Wj)]}. (20) 

Since p^Zijit] \yij[t]) = I{yij[t] e ), it follows that 
qiVijit ]) is defined on the interval Tzij[t ], which is the cell 
corresponding to class Zij[t]. After taking the expectation, 
the optimal q distribution is found to be 

qiyijW ) = [yijlt] \{EqUi[t] ,EqWj[t] ),a^). 

(21) 

This is a truncated normal distribution on the support 
with mean parameter {EqUi [t] ,'EqWj[t]) (defined later) and 
variance cr^. 

We can follow the same procedure to find q{ui[t]) for 
time t. Again ignoring the time component we have 

q{ui) oc exp{Eg[lnp(yij|ui,M;j)] +lnp(ui)}. (22) 

The prior on Ui[t] is discussed below. Solving for this 
distribution shows that q{ui[t]) is a multivariate Gaussian, 

qiudt]) = N [udt] ) , (23) 

with mean and covariance parameter given below. Symmetry 
results in the same q distribution family for Wj[t]. 

D. A coordinate ascent algorithm for q{u), q{w) and q{y) 

We next present a coordinate ascent algorithm for updat¬ 
ing the three q distributions that appear at time t. 

Coordinate update of q{y)\ The analytical update for the 
mean parameter mij[t] of q{yij[t] ) is 

mij[t] = {EqUiit] ,EqWj[t]). (24) 

The values of EqU^t] and EqWj[t] are given below. The 
expectation of yij [ t ] depends on the interval in which it falls 
according to the truncated normal. Let lzij[t] and be 

the left and right boundaries of this interval. Then defining 

^ W«i -rriijW ^ ^ ^ ^ 

(J (J 


Ct = Eq\hip{zij[t] ,Ui[t] ,Wj[t] ,yij[t] )]-Eq[lng]. (19) 

Since this is a non-convex function, a local maximum of Ct 
is found by optimizing the parameters of each q as discussed 
in Section HVd] 


we have that 


EqPijit] =mij[t] +a 




(25) 


where </>(■) is the pdf and $(■) the cdf of a standard normal. 















Coordinate update of q{u)\ The updates for mean 
and covariance of q{ui[t]) are 

K. = fzi + , 

^^Ui ~ ^Ui (^q[yij]t^Wj/^ • ( 26 ) 

All parameters above are functions evaluated at t. We recall 
that the prior mean ~ ^uj] prior 

covariance ~ “ ^w-] + /, which 

follows from the approximation discussed above. 

Coordinate update of q{w)\ Because of symmetry, the 
updates for .[t] and .[t] of q{wj[t]) are similar to 
those for Ui, but with the indices Ui and Wj switched on all 
means and covariances. We therefore omit these updates. 


E. Inferring the geometric Brownian motion expjaft7} 

We next derive an algorithm for inferring the drift process 
a[t] used in the geometric Brownian motion. We derive a 
point estimate of this value assuming individual a for each Ui 
and Wj. To update, e.g., Ouft] we approximate the relevant 
terms in jC using a second order Taylor expansion about the 
point Oui [ t — ], which is the last inferred value for this 

process. We recall that the form of this approximation for a 
general function f{x) is 

f{x) « f{xo) + {x- xo)f{xo) + \{x- Xoff"(xo), 

with the solution at the point x = xq — f"{xo)~^ f'{xo). In 
our case, / = Eq[\np{ui[t] .Ouft ])]. 

We can efficiently update Ouft] using this Taylor ex¬ 
pansion as follows: Let = QAQ^ be the 

eigendecomposition of the posterior covariance of Ui at the 
previous observation. (This needs to be done only once for 
the updates at this time point.) Also, let 

V = {PuW - ynW fQ, M = Q. 


Since this depends on the updated posterior mean and 
covariance of Ui, v and M should be recomputed after each 
update of q{ui). Let Xd be the dth eigenvalue of [t—A^^^^ ] 
and define 


Then using a second order Taylor expansion of the objective 
function at the previous point Ouft — A^^ , ] gives the 
following first and second derivatives with respect to ., 


f' — a[t]-a[t-Ag] 1 ^ Vi '^d+^d,d \ 

J ~ cAg 2 l^d'ld\^ Xd+e^-t^Ag)^ 

f" = -^-lY.dVd{i-Vd) 

+ (27) 


We have removed the subscript dependency on Ui above. 
We then use the closed form equation from the Taylor 
expansion to update Ouft]. The update for [t ] has the 

same form, but with the relevant variational parameters and 
eigendecomposition replacing those above. 


F. Simplifications to the model and predictions of new data 

When Uij[t] is observed only a slight modification needs 
to be made to the algorithm above. In this case q{y) is no 
longer necessary and in the update for each q{u) and q{w) 
the term EqPijit] can be replaced with the observed value. 

Predictions using the CKF require an approximation of 
an intractable integral. One approximation is to use Monte 
Carlo methods. Suppressing the time t, the expectation we 
are interested in taking is Vij := Eq[^{{ui,Wj)/a)] We can 
approximate Vij using S samples from these distributions, 

(28) 

S = 1 

where q{ui[t] ) and q{wj[t] ). This gives an 

unbiased estimate of the expectation. A faster approximation 
is to directly use the means from q for ufit] and Wj[t]. 

V. Experiments 

We evaluate the performance of the collaborative Kalman 
filter on the following three data sets: 

• The Netfiix data set containing 100 million movie 
ratings from 1999 to 2006. The movies are rated from 
1 to 5 and ratings are distributed across roughly 17K 
movies and 480K users. 

• The MovieLens data set containing 10 million movie 
ratings from 1995 to 2009. The ratings are given on a 
half star scale from 0.5 to 5 and are distributed across 
lOK movies and 7IK users. 

• Stock returns data measured at opening and closing 
times for 433 companies from the AMEX exchange, 
2,774 companies from NASDAQ and 3,273 companies 
from the NYSE for a total of 6,480 stocks and 39.1 
million total measurements from 1962-20140 

The model requires setting some important parameters: The 
dimension d of Ui and Wj, the standard deviation of the 
probit model a, the partition widths Vk — and the drift 
parameter c for the geometric Brownian motion a[t] . We 
discuss these below for the respective problems. 

A. Movie rating prediction 

We first present results on dynamic modeling of the 
Netfiix and MovieLens data sets. For these problems, we 
tried several different dimension settings {d = 10, 20, 30). 
We learned a converging shared among users and 
shared by movies by setting c = 0 since we do not 
assume there to be fundamental shifts in the overall user or 
movie landscape. We set a to be the value that minimizes 
the KL divergence between the probit and logistic link 
functions, which we found to be approximately a = 1.76. 
We randomly initialized the mean of the priors on each Ui 
and Wj at time zero, and set the covariance equal to the 

^Data downloaded from http://ichart.finance.yahoo.com/ 
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Figure 1. Cumulative total of movies rated by two users from the Netfiix 
data set. (left) This user rates large batches of movies in a few sittings. 
Though the dynamics won’t be captured by the model, we still can perform 
sequential inference for this user to make predictions, (right) A more 
incremental rating pattern that has dynamic value. 



(a) Netfiix (b) MovieLens 


Figure 2. Histograms of the total number of ratings within a month over 
the course of each data set. 


identity matrix. For the partition width, we set — //e = cr 
for Netfiix and r^—lk = cr/2 for MovieLens, which accounts 
for the half vs. whole star rating system. We compare with 
several algorithms: 

1) Online VB-EM: The non-drift special case of the CKF 
with exp{a[t] } = 0. 

2) Batch VB-EM: The batch variational inference version 
of (1) that uses all ratings when updating Ui and Wj. 

3) Batch MAP-EM: A version of probabilistic matrix 
factorization (PMF) im that uses the probit model 
as above. Point estimates of Ui and Wj are learned 
and the hidden data yij is inferred using EM. 

4) BPME: Bayesian PME ifT^ without a probit model. 

5) M^E: Mixed-membership matrix factorization ||9l. 
The zero-drift version of this model is an online sequential 
method for Bayesian inference that is similar to other “big 
data” extensions of the variational inference approach (H, 

The difference between this and the batch model is that 
we only process each rating once with the online algorithm, 
while with batch inference we iterate over users and movies 
several times processing all data in a single iteration, as is 
more typically done. 

The ability of our model to exploit dynamic information 
depends on how dynamic the data is. Eor example, in Eigure 

we show the cumulative number of ratings for two users 
as a function of time. With the first user, we don’t expect 


Table I 

RMSE RESULTS FOR THE NETFLIX 100 MILLION AND MOVIELENS 10 
MILLION DATA SETS. COMPARISONS SHOW AN ADVANTAGE TO 
MODELING THE DYNAMIC INFORMATION WITHIN THE DATA. 


Model 

Size 

Netfiix 

MovieLens 

CKE 

d=10 

0.8540 

0.7726 


d = 20 

0.8534 

0.7654 


II 

CO 

o 

0.8540 

0.7635 

Online VB-EM 

d=10 

0.8682 

0.7855 


d = 20 

0.8707 

0.7805 


II 

CO 

o 

0.8668 

0.7786 

Batch VB-EM 

d=10 

0.8825 

0.7996 


d = 20 

0.8688 

0.7896 


II 

CO 

o 

0.8638 

0.7865 

Batch MAP-EM 

d=10 

0.9277 

0.9133 


d = 20 

0.9182 

0.9113 


II 

CO 

o 

0.9143 

0.9133 

BPME 

II 

CO 

o 

0.9047 

0.8472 

M^F 

II 

CO 

o 

0.9015 

0.8447 


to have a dynamic benefit, but we do for the second user. 
However, as an online algorithm we note that the CKE still 
can make useful predictions in both cases; in the limiting 
case that all users rate all movies in a given instant, the 
CKE will simply reduce to the online zero-drift model. In 
Eigure we show the monthly ratings histogram for both 
data sets, which gives a sense of the dynamic information 
from the movie perspective. 

We use the RMSE as a performance measure, which we 
show for all algorithms in Table |T| Eor the batch algorithms, 
we randomly held out 5% of the data for testing to calculate 
this value. We ran multiple trials and found that the standard 
deviation for these algorithms was small enough to omit. 
(We discuss training/testing for the online algorithms below.) 
We observe that our algorithm outperforms the other base¬ 
line algorithms, and so dynamic modeling of user preference 
does indeed given an improvement in rating prediction. This 
is especially evident when comparing the CKE with Online 
VB, the only difference between these algorithms being the 
introduction of a drift in the state space vectors. 

We also observe the improvement of the variational frame¬ 
work in general. Eor example, by comparing Batch VB 
with PME EM, we see that variational inference provides 
an improvement over a MAP EM implementation, which 
models a point estimate of Ui and Wj. Both methods use 
a latent variable yij in a probit function for the observed 
rating, but the variational approach models the uncertainty in 
these state space vectors, and so we see that a fully Bayesian 
approach is helpful. We also found in our experiments that 
treating the rating Zij[t] as being generated from a probit 
model and learning a latent yij is important as well, which 
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Figure 3. The RMSE as a function of number of ratings for a user and 
movie. The (m, n) entry contains the RMSE calculated over user/movie 
pairs where the user has rated at least 10 (m — 1) movies and the movie 
has been rated at least 10(n — 1) times. The value shown in the lower-right 
corner is 200+ ratings each, which we use in Table [I| 



we observed by comparing PMF EM with the original PMF 
algorithm |[TT1l |^ 

Calculating the RMSE requires a different approach be¬ 
tween the online and static models. To calculate the RMSE 
for the two dynamic models—CKF and the online model— 
we do not use the test set for the batch models, but instead 
make predictions of every rating in the data set before using 
the observed rating to update the model. This gives a more 
realistic setting for measuring performance of the online 
models, especially for the CKF where we’re interested in 
predicting the user’s rating at that time. Therefore, we must 
choose which predictions to calculate the RMSE over, since 
clearly it will be bad for the first several ratings for a 
particular user or movie. By looking at the users and movies 
that appear in the testing sets of the batch models we found 
that for Netfiix each user in the test set had an average of 
613 ratings in the training set and each movie in the test 
set had 53,395 ratings in the training set. For MovieLens 
this was 447 per user and 7,157 per movie, meaning that in 
both cases a substantial amount of data was used to learn 
locations in training before making predictions in testing. 
Therefore, to calculate our RMSE of the online models we 
disregard the prediction if either the user or movie has under 
200 previous ratings. This arguably still puts the RMSE for 
our model at a disadvantage, but we noticed that the value 
didn’t change much with values larger than 200. We show 
the RMSE as a function of this number in Eigure 

In Eigure we show the dynamics of an individual user 
by plotting the predicted rating for a set of movies as a 
function of time. We can see an evolving preference in 
this plot-for example a strong interest in Batman Begins 
that then slightly decreases with time, while interest in 
The Matrix begins to increase toward the end. Also, while 
there’s some interest in Anchorman at the beginning, the 
user then quickly becomes uninterested. In most cases, we 

"^We omit these PME results in the table, which we note gave RMSE 
results over one as can be seen in the original paper. We also note that the 
PME algorithm is the non-dynamic version of oa. 



Month 


Figure 4. An example of a user drift over time as seen through predicted 
ratings for several movies from the Netfiix data set. The y-axis is the latent 
variable space, which we partition according to star rating as indicated. 


observed no clear evolution in movie preference for a user. 
We consider to be intuitively reasonable since we argue 
that few people fundamentally change their taste. However, 
being able to find those individual users or movies that 
do undergo a significant change can have an impact on 
learning the latent vectors for all users and movies since the 
model is collaborative, and so we argue that modeling time 
evolution can be valuable even when the actual percentage 
of dynamically changing behaviors is small. 


B. Stock Data 


We next present a qualitative evaluation of our model on a 
stock returns data set. Eor this problem, the value of yij [ t ] is 
observed since it is treated as the continuous-valued stock 
price, so we can model it directly as discussed in section 
mi We set the latent dimension d = 5, but observed 
similar results for higher values. We learned stock-specific 
drift Brownian motions and set Cu = 5 x 10“^, 

which we found to be a good setting through parameter 
tuning since the learned Ouft] were not too smooth, but 
still stable. We also observe that, for this problem, there is 
only one state vector corresponding to w, which we refer 
to as a “state-of-the-world” (SOW) vector. Therefore, this 
problem falls more within the traditional Kalman filtering 


setting discussed in Section |II-B[ Eor the SOW vector, we 
set Cw = 0 to learn a converging value of which we 

note was a^[t] ^ —11.7. Eor the noise standard deviation 
we set a = 0.01, which enforce that the state space vectors 
track yij[t] closely. (We again note that j = I for this 
problem.) We plot the number of active stocks by year in 
Eigure where we see that as time goes by, the number of 
stocks actively being traded increases significantly. 

We first assess the tracking ability of our model. The 















year 

Figure 5. The number of actively traded stocks as a function of time. 
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Figure 6. A histogram of the tracking errors of the CKF for four 
stocks using the mean of each q distribution (in log 2 scale). The tracking 
performance is very accurate using a 5 dimensional latent space (order 
10“^). These result are typical of all histograms. 


ability to accurately track the stock prices indicates that 
the latent structure being learned is capturing meaningful 
information about the data set. We show these results as 
error histograms on log 2 scale in Figure for four companies 
(tracking was very accurate and could not be distinguished 
visually from the true signal) and mention that these results 
are representative of all tracking performances. We see from 
these plots that the prediction errors of the stock prices are 
small, on the order of 10“^, and so we can conclude that 
our five dimensional state space representation for Ui and w 
is sufficient to capture all degrees of freedom in time across 
the 6,480 stocks. 

We next look more closely at the Brownian motions 
aui[t] learned by the model to capture the volatility of the 
stock prices. In Figure [TJa) we show the historical stock 
price for two oil companies, BP and Chevron. Below this 
in Figure [^b) we show the respective functions au- learned 
for these stocks. We see that the volatility of both of these 
oil companies share similar shapes since they are closely 



(b) Brownian motion volatility parameter au^it ]. 


Figure 7. (a) The historical stock price for two oil companies, BP 

and Chevron, (b) The log drift Brownian motion (au^it]) indicating the 
volatility of each stock. We see that though the stock prices are different, 
the volatility of both oil companies share the same shape since they are 
closely linked in the market. 


linked in the market. For example in the early 80’s, late 
90’s and late 2000’s, oil prices were particularly volatile. 
This is modeled by the increase of the geometric Brownian 
motion expja^ijt] }, which captures the fact that the state 
vectors are moving around significantly in the latent space 
during this time to rapidly adjust to stock prices. 

We also consider the stock volatility across different 
market sectors. In Figure we show the historical stock 
prices for five companies along the top row and their 
respective below them on the second row. Three of 

the companies are in the steel market, while the other two 
are from different markets (pharmaceutical and beverage). 
We again see that the learned Brownian motion captures a 
similar volatility for the steel companies. In each case, there 
is significant volatility associated with the 2008 financial 
crisis due to the decrease in new construction. This volatility 
is not found with the pharmaceutical or beverage companies. 
However, we do see a major spike in volatility for Coca-Cola 
around 1985, which was a result of their unsuccessful “New 
Coke” experiment. These observations are confirmed by the 
respective stock prices. However, we note that the level of 
volatility is not only associated with large changes in stock 
value. In the Pfizer example, we see that the volatility is 
consistently high for the first half of its stock life, and then 
decreases significantly for the second half, which is due to 
the significantly different levels of volatility before and after 
the year 2000. 
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Figure 8. (Top row) The historical stock prices for three steel companies, one pharmaceutical and one beverage company. (Bottom row) The corresponding 
log drift Brownian motions (au^it ]) for the respective stocks from the top row. We see that the three steel companies shared high volatility during the 
period of the 2008 financial crises, but companies in other areas such as the pharmaceutical and beverage industry were not similarly affected. 


VI. Conclusion 

We have presented a collaborative Kalman filter for dy¬ 
namic collaborative filtering. The model uses the matrix 
factorization approach, which embeds users and objects in a 
latent space. However, unlike most related models in which 
these locations are fixed in time, we allowed them to drift 
according to a Brownian motion. This allows for dynamic 
evolution of, e.g., user preference. We built on this basic 
model by presenting a method for learning a time-evolving 
drift parameter for the state space vectors using a geometric 
Brownian motion. 

Many other applications could potentially benefit from 
this framework. For example, we could predict results of bat- 
ter/pitcher events over time in baseball or winners of sporting 
competitions. Such a dynamic model would clearly be an 
advantage since player or team performance can experience 
streaks and slumps. Another potential use is in automated 
tutoring systems, where a student is asked a sequence of 
questions (e.g., math questions or language fiashcards). In 
a collaborative filtering environment we would wish to 
estimate the probability that a student answers a question 
correctly. Clearly as the student learns these probabilities 
will evolve. Having estimates for such a probability would 
allow for intelligent tutoring, where questions are selected 
that are not too easy or too hard and are able to help with 
the learning process. 
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